
Cocojunk
🚀 Dive deep with CocoJunk – your destination for detailed, well-researched articles across science, technology, culture, and more. Explore knowledge that matters, explained in plain English.
Machine code
Read the original article here.
Machine Code: The CPU's Native Language
In the journey of understanding how computers fundamentally work, especially when exploring the idea of "building a computer from scratch," one must eventually confront machine code. This is the absolute lowest level of programming language directly understood and executed by the Central Processing Unit (CPU). It's the language of the hardware itself.
What is Machine Code?
Machine Code: The computer code consisting of machine language instructions. It is the numerical, binary representation of a computer program that is directly read and interpreted by the Central Processing Unit (CPU).
For a conventional binary computer, machine code is a sequence of bits (0s and 1s) that tells the CPU exactly what operations to perform. A program in machine code is essentially a list of these binary instructions, possibly interspersed with the data they operate on.
Think of machine code as the raw electrical signals (represented by high/low voltage, or 1s and 0s) that directly control the logic gates and circuits within the CPU. When you write code in a language like Python or C++, it eventually needs to be translated into this fundamental binary language for the CPU to execute it.
The CPU and Machine Instructions
The CPU is the "brain" of the computer, responsible for executing instructions. Each machine code instruction causes the CPU to perform a very specific, atomic task. These tasks are the fundamental operations the CPU is designed to carry out.
Examples of basic tasks performed by single machine instructions include:
- Data Movement: Copying a value from a memory location into a high-speed storage area within the CPU called a register, or vice versa.
- Example:
LOAD R1, memory_address_X
(Conceptually: get the value at memory address X and put it in Register 1).
- Example:
- Arithmetic and Logic Operations: Performing calculations or logical comparisons using the CPU's Arithmetic Logic Unit (ALU). These operations are typically performed on data held in registers or memory locations.
- Example:
ADD R1, R2, R3
(Conceptually: Add the values in Register 2 and Register 3 and put the result in Register 1).
- Example:
- Control Flow Changes: Altering the sequence of instruction execution, such as jumping to a different part of the program or conditionally skipping the next instruction based on a condition (like the result of a comparison). This is how programs make decisions and repeat actions (loops).
- Example:
JUMP label_Y
(Conceptually: Go to the instruction located at 'label_Y'). - Example:
BRANCH_IF_ZERO label_Z, R1
(Conceptually: If the value in Register 1 is zero, go to the instruction at 'label_Z'; otherwise, continue to the next instruction in sequence).
- Example:
Architecture Dependence
One crucial aspect of machine code is that it is architecture-specific.
Instruction Set Architecture (ISA): A specification that defines the set of instructions a particular processor or family of processors can execute. It includes the instruction set, instruction formats, register set, memory addressing modes, and other characteristics of the processor's instruction-level interface.
Every different CPU family (like x86 used in most PCs, ARM used in smartphones and many newer computers, or the MIPS architecture often used in embedded systems and for teaching) has its own unique Instruction Set Architecture (ISA). This means the binary patterns that represent instructions for an x86 CPU are different from those for an ARM CPU.
For example, the binary code 10110000 01100001
might mean "move the value 97 into the AL register" on an x86 processor, but it would mean something completely different (or be an invalid instruction) on an ARM processor.
There are some rare exceptions where processors might support instructions from more than one ISA (like the VAX supporting PDP-11 instructions or the PowerPC 615 supporting x86), but these are not common for general-purpose computing today.
The Numerical Language and Its Challenges
Machine code is a strictly numerical language. It's just patterns of bits. This makes it the most direct way to communicate with the CPU. However, this numerical nature is what makes writing programs directly in machine code extremely difficult for humans.
Imagine trying to write a complex program by hand, specifying every single instruction as a sequence of 0s and 1s, and manually calculating memory addresses in binary or hexadecimal. This process is incredibly tedious, error-prone, and very difficult to read or debug.
While theoretically possible, programs are rarely written directly in machine code. Its primary role today is as the target output for compilers and assemblers. However, understanding machine code is essential for low-level programming, debugging, and performance optimization, especially when you don't have access to the original source code (e.g., reverse engineering or patching existing binaries).
From High-Level to Machine Code
Most software development today happens using high-level programming languages (like Python, Java, C++, etc.). These languages are designed to be human-readable and abstract away the complexities of the underlying hardware.
Compiler: A program that translates source code written in a high-level programming language into a lower-level language, such as assembly language or machine code.
A compiler is the essential tool that translates the human-readable source code of a high-level language program into the specific machine code for the target CPU architecture. This compiled machine code is what the computer's operating system then loads into memory and tells the CPU to execute.
The Relationship with Assembly Language
To bridge the gap between raw machine code and human readability, assembly language was created.
Assembly Language: A low-level programming language that uses mnemonic codes and symbolic names to represent machine code instructions and memory locations. There is typically a one-to-one correspondence between assembly language instructions and machine code instructions for a specific architecture.
Instead of using the numerical (binary or hexadecimal) value of an instruction, assembly language uses short, human-readable mnemonics. For example, on the x86 architecture, the machine code 0x90
(hexadecimal for 10010000
binary) is represented by the mnemonic NOP
(No Operation) in assembly language.
Similarly, instead of using raw memory addresses like 0x1A4C
, assembly language allows programmers to use symbolic names (labels) like player_score
. The process of converting assembly language into machine code is done by a program called an assembler.
Feature | Machine Code | Assembly Language | High-Level Language |
---|---|---|---|
Representation | Binary/Numerical (0s/1s) | Mnemonics, Symbols | Human-readable syntax |
Readability | Very Low | Low | High |
CPU Understanding | Direct | Requires Assembler | Requires Compiler |
Abstraction | None (Direct Hardware Control) | Low (Direct Machine Instruction Map) | High (Abstracts Hardware) |
Architecture | Specific | Specific (1:1 map) | Often Portable (with Compiler) |
Assembly language is still low-level and architecture-specific, but it's significantly easier for humans to read, write, and understand compared to raw machine code. It provides a direct window into the operations the CPU is performing.
Exploring the Instruction Set
The instruction set is the complete collection of operations that a specific CPU can perform. It's the vocabulary of the machine language for that CPU.
Instruction Format
Each instruction in the instruction set has a specific format, which is a pattern of bits. This format dictates how the bits within the instruction are divided into fields, where each field represents a different piece of information the CPU needs to execute the instruction.
Common ways instruction formats can differ across architectures:
- Length: Instructions can have a fixed length (e.g., always 32 bits) or a variable length (e.g., x86 instructions can range from 1 to 15 bytes).
- Number of Instructions: The size of the instruction set can vary greatly from architecture to architecture (some have a small, simple set; others have a very large, complex set).
- Alignment: Instructions may or may not be required to start at memory addresses that are multiples of the architecture's word length (a typical unit of data for the processor).
Control at the Digital Logic Level
Ultimately, the bits in a machine instruction directly control the fundamental circuits of the computer's digital logic level. The instruction tells the CPU's internal control unit which components to activate and how to configure them for that specific clock cycle or sequence of cycles. This includes controlling:
- Registers: Which registers are involved in the operation (source, destination).
Register: A small, high-speed storage location directly within the CPU. Registers are used to hold data that the CPU is actively working on or about to work on, such as operands for calculations, memory addresses, or temporary results. Accessing data in registers is much faster than accessing data in main memory (RAM).
- Bus: How data is moved between components (CPU, memory, I/O devices) via the data and address buses.
Bus: A collection of electrical wires or traces that carry data, addresses, and control signals between different components of a computer system.
- Memory: When and where data is read from or written to main memory (RAM).
- ALU: Which arithmetic or logic operation the ALU should perform (addition, subtraction, AND, OR, etc.) and on which inputs.
- Other Hardware Components: Controlling specialized units or architectural features.
Machine instructions are also created to control specific architectural features beyond basic operations, such as:
- Segment Registers: On architectures like x86 in older modes, segment registers define blocks of memory, and specific instructions are used to load and manage these registers.
- Protected Address Mode: Instructions related to managing memory protection, multitasking, and privilege levels in modern operating systems.
Protected Mode: An operating mode of x86 processors that provides memory protection, multitasking capabilities, and hardware support for virtual memory and paging, preventing programs from interfering with each other or the operating system.
- Binary-Coded Decimal (BCD) Arithmetic: Instructions designed specifically to perform arithmetic on numbers stored in BCD format, which was sometimes used in financial applications to avoid floating-point issues.
Instruction Format Design Criteria
The design of an instruction set and its formats involves trade-offs, often between performance, code size, and hardware complexity. Criteria considered include:
- Frequency of Use: More commonly used instructions are often designed to be shorter or faster to decode, while less common ones might be longer or more complex.
- Memory Transfer Rate: The speed at which the computer can fetch instructions from memory influences the flexibility of instructions that involve accessing memory. If fetching is slow, variable-length instructions might be less efficient due to complex decoding, favoring fixed-length instructions or more instructions that operate only on registers.
- Address Field Size: This is a significant design decision. The number of bits dedicated to specifying memory addresses directly impacts the amount of memory an instruction can directly reference.
- Trade-off: A shorter address field means instructions are smaller, saving memory space and potentially allowing more instructions to be fetched at once, speeding things up. However, a shorter address field limits the maximum memory address that can be directly accessed by the instruction. This might necessitate using additional instructions or registers to access memory beyond that limit.
- Considerations: Designers must consider not only the size of physical memory but also the needs of a virtual address space. Limitations on the size of registers used for addressing also play a role.
Types of Instructions
Machine instructions can often be broadly categorized:
- General-Purpose Instructions: These control fundamental operations common to most computers and necessary for any significant computation. Examples include:
- Data movement (copying data)
- Monadic operations (one operand, e.g., decrement, bitwise NOT)
- Dyadic operations (two operands, e.g., add, subtract, bitwise AND)
- Comparisons and conditional jumps (checking conditions and changing control flow)
- Procedure calls and returns (managing subroutines/functions)
- Loop control (instructions specifically designed to manage loops)
- Input/Output (interacting with peripherals)
- Special-Purpose Instructions: These exploit unique architectural features of a particular computer or family. Examples might include specialized instructions for cryptography, vector processing, or specific hardware control that aren't found on all CPUs.
Machine Code Examples: Different Architectures
Looking at specific architectures helps illustrate the concepts of instruction format and architecture dependence.
IBM 709x (An Older Architecture Example)
The IBM 704, 709, and their successors were early computers (from the 1950s/60s). They illustrate instruction formats from a different era:
- Instructions were stored one per "instruction word" (36 bits).
- Bits within the word had specific meanings. For example, bits 1-11 might be the opcode, while bits 21-35 might be an address (Y).
- They used index registers to modify addresses. The "Tag" field indicated which index register(s) to use for address calculation. The effective address was often calculated as
Y - C(T)
, whereC(T)
was the content of the selected index register(s).Index Register: A CPU register used to modify an operand address. The effective address is calculated by adding or subtracting the contents of the index register to/from a base address specified in the instruction. This is fundamental for accessing data in arrays or structures.
- They supported indirect addressing, where the address field in the instruction didn't hold the final data address, but rather the address of another memory location that held the final data address.
- They had instructions like
Compare Accumulator with Storage (CAS)
which would perform a comparison and then skip one or two instructions based on the result (a form of conditional branching).
These details show how even early machine codes had sophisticated mechanisms for addressing memory and controlling program flow, though the formats and approaches differ from modern CPUs.
MIPS (A Fixed-Length Architecture Example)
The MIPS architecture is often used in educational settings because of its relatively simple, fixed-length instruction format (all instructions are 32 bits long). This makes decoding instructions straightforward.
MIPS instructions are structured into different types (R-type, I-type, J-type), determined by the highest 6 bits (the op
field).
31-26 25-21 20-16 15-11 10-6 5-0 (Bit positions)
[ op | rs | rt | rd | shamt| funct ] R-type (Register operations)
[ op | rs | rt | address/immediate ] I-type (Immediate/Data transfer/Branch)
[ op | target address ] J-type (Jump)
op
: Specifies the main operation type.rs
,rt
,rd
: Fields specifying source and destination registers involved in the operation.Operand: The data that an instruction operates on. Operands can be located in registers, in memory, or provided directly within the instruction itself (an immediate value).
shamt
: Shift amount for shift operations.funct
: For R-type instructions, this field, along withop
, specifies the exact operation (e.g., add, subtract, AND, OR).address/immediate
: A value used directly by the instruction (immediate) or an offset/address.target address
: The destination address for a jump instruction.
MIPS Instruction Examples:
Add registers 1 and 2, store in register 6: This is an R-type instruction.
- Operation: ADD
- Destination Register (
rd
): Register 6 - Source Register 1 (
rs
): Register 1 - Source Register 2 (
rt
): Register 2 - Shift amount (
shamt
): 0 (not used for ADD) op
for ADD is 0 (R-type)funct
for ADD is 32 decimal (100000 binary)
Bits: 31-26 25-21 20-16 15-11 10-6 5-0 Binary: 000000 00001 00010 00110 00000 100000 Decimal: 0 1 2 6 0 32
Load word into register 8 from memory: This is an I-type instruction (Load Word, LW). The memory address is calculated by adding an offset (68) to the value in a base register (register 3).
- Operation: LW
- Destination Register (
rt
): Register 8 (note: LW usesrt
as destination) - Base Register (
rs
): Register 3 - Offset/Immediate: 68 decimal
op
for LW is 35 decimal (100011 binary)
Bits: 31-26 25-21 20-16 15-11 15-0 Binary: 100011 00011 01000 000000001000100 Decimal: 35 3 8 68
Jump to address 1024: This is a J-type instruction.
- Operation: JUMP
- Target Address: 1024 (This value is part of the instruction, adjusted by the CPU to form the full 32-bit address).
op
for JUMP is 2 decimal (000010 binary)
Bits: 31-26 25-0 Binary: 000010 00000000000000100000000000 Decimal: 2 1024
(Note: The target address field in J-type is 26 bits, representing the word address which is then shifted and combined with higher bits of the Program Counter.)
These examples show how different parts of the binary instruction directly encode the operation, the registers involved, and values or addresses needed by the instruction.
Overlapping Instructions
On some processor architectures, particularly those with variable-length instruction sets (like Intel x86), it's possible to construct sequences of bytes that can be interpreted as valid machine instructions starting at different byte offsets. This creates "overlapping instructions" or "overlapping opcodes."
This technique, also known as "instruction scission" or "jump into the middle of an instruction," involves carefully arranging instruction bytes so that a sequence of bytes B1 B2 B3 B4
might be interpreted as Instruction A
(starting at B1
) and also Instruction B
(starting at B2
), where Instruction A
uses bytes B1 B2 B3
and Instruction B
uses bytes B2 B3 B4
.
- Historical Use: In the past (like the 1970s/80s), this was sometimes done to save precious memory space, particularly in very constrained environments like early personal computers (e.g., error tables in Microsoft's Altair BASIC).
- Modern Use Cases:
- Extreme Optimization: Still occasionally used in places where every byte counts, such as the tiny programs stored in boot sectors (boot loaders).
- Code Obfuscation: Can make disassembled code harder to read and analyze, used to protect intellectual property or hinder tampering.
- Security Exploits: This property is exploited in techniques like Return-Oriented Programming (ROP). Attackers chain together small, existing instruction sequences ("gadgets") found within legitimate programs by manipulating the program's execution stack to jump into the middle or beginning of these sequences.
While fascinating from a low-level perspective, deliberately creating overlapping instructions is complex and rarely necessary in most modern software development.
Relationship to Other Low-Level Concepts
Understanding machine code is clearer when contrasted with related low-level concepts:
Machine Code vs. Microcode
Microcode: A layer of control logic within some CPUs, where complex machine code instructions are actually implemented as a sequence of simpler, more primitive operations called microinstructions. Microcode acts like a built-in interpreter for the machine code instructions.
Some processors (especially complex instruction set computers, or CISC) use microcode. In these architectures, when the CPU fetches a machine code instruction, it doesn't execute it directly as a single fundamental hardware operation. Instead, the control unit looks up the instruction in a microcode ROM (Read-Only Memory), and the CPU executes a sequence of microinstructions defined there. This allows complex machine instructions to be implemented using simpler hardware logic and facilitates building a family of processors with the same ISA but different underlying hardware by changing the microcode. Reduced Instruction Set Computers (RISC) typically execute most machine instructions directly in hardware without a microcode layer.
Machine Code vs. Bytecode
Bytecode (or p-code): An intermediate representation of a program, designed to be executed by a software interpreter or a virtual machine, rather than directly by a physical CPU. It is often more portable than machine code.
Bytecode sits at a level above machine code. Languages like Java compile source code into Java bytecode (.class
files), which is then run by the Java Virtual Machine (JVM). The JVM interprets the bytecode or uses a Just-In-Time (JIT) compiler to translate parts of the bytecode into the native machine code of the host computer while the program is running.
The key difference is that machine code is designed for a specific physical CPU, while bytecode is designed for a virtual execution environment. The main exception is if a processor is specifically designed to execute a particular bytecode format directly as its machine code (like some specialized Java processors, which are rare).
Native Code: A term often used to refer to machine code or assembly code compiled for a specific processor and operating system platform. When a program or library part is described as "native," it means it's compiled into machine code for that platform, as opposed to being interpreted or running on a virtual machine.
Storing and Executing Machine Code in Memory
When a program runs, its machine code instructions must be loaded into the computer's main memory (RAM).
RAM (Random Access Memory): The primary working memory of a computer, used to store data and machine code instructions that the CPU is actively using.
The CPU fetches instructions from RAM. To speed up access, copies of frequently used instructions and data are often stored in smaller, faster memory located closer to the CPU, called caches. Many modern CPUs have separate caches for instructions (instruction cache) and data (data cache).
The Program Counter
The CPU keeps track of which instruction it is currently executing (or about to execute) using a special register called the Program Counter (PC).
Program Counter (PC): A CPU register that holds the memory address of the next instruction to be fetched and executed. It is automatically incremented after fetching most instructions to point to the subsequent instruction in sequence.
Normally, after executing an instruction, the PC is incremented to point to the very next instruction in memory. This results in sequential execution. However, instructions like JUMP, BRANCH, or CALL explicitly modify the PC to a different memory address, allowing the program to branch, loop, or call subroutines.
When a computer is powered on, the CPU's Program Counter is typically set to a predefined, hard-coded memory address (often located in firmware like the BIOS or UEFI). The CPU then begins executing the machine code found at that address. This initial code is responsible for starting the boot process.
Crucially, the CPU will attempt to execute whatever is at the memory address pointed to by the PC. If that memory location doesn't contain valid machine code, it will likely cause an error or "fault."
Memory Protection
Modern operating systems and CPU architectures implement mechanisms to prevent unauthorized access to memory and to ensure that only valid machine code is executed.
- Paging: Operating systems use paging to divide memory into fixed-size blocks called pages. Each page is assigned permissions.
Execute Bit: A permission flag associated with a memory page in paging systems. If set, it indicates that machine code stored in this page can be executed by the CPU. If not set, attempting to execute code in this page will trigger a protection fault. Operating systems use permissions like "readable," "writable," and "executable." The execute bit is vital: if a page containing data (like user input or variables) is marked as non-executable, the CPU hardware will prevent the PC from being directed into that page for execution. Attempting to execute data triggers a protection fault (like a "Segmentation Fault" or "Access Violation"), which the operating system typically handles by terminating the offending program. This is a fundamental security measure against many types of exploits that try to inject and run malicious code. System calls (
mprotect()
on Unix-like systems,VirtualProtect()
on Windows) allow programs to change these permissions under OS control. - Segmentation: Older architectures or modes (like x86 segmented mode) use segments instead of pages.
Segment Descriptor: In a segment-based memory system, a structure that defines the properties of a memory segment, including its base address, size, and access permissions (read, write, execute). Segment descriptors specify whether a segment contains executable code and the privilege level ("ring") at which that code is allowed to run.
Code Space
Within the memory space allocated to a running program (a process), the portion dedicated to storing the program's machine code instructions is often referred to as the code space or text segment. In multitasking operating systems, multiple processes can be running concurrently. If these processes use the same program (e.g., multiple users running the same web browser application), they can often share the same copy of the machine code in memory, reducing memory usage.
In multi-threading environments, where a single process has multiple threads of execution running concurrently, these threads typically share the entire memory space of the process, including the code space and data space. This sharing is why switching between threads within the same process is generally much faster (lower context switching overhead) than switching between different processes.
Readability by Humans and Debugging Tools
As Douglas Hofstadter observed, trying to read raw machine code is akin to examining the atoms of a DNA molecule – you see the fundamental components (the bits), but the overall structure, meaning, and purpose are obscured.
However, tools exist to make machine code understandable to humans:
- Disassemblers: These programs perform disassembly, translating machine code back into its equivalent assembly language representation. Since there is a one-to-one mapping between most machine instructions and assembly mnemonics, this is a relatively straightforward process. The output is still low-level but uses mnemonics and often shows numerical addresses and values in hexadecimal, which is more human-friendly than binary.
Disassembly: The process of translating machine code into assembly language.
- Decompilers: For machine code that was originally compiled from a high-level language, a decompiler attempts to translate it back into source code in a higher-level language. This is a much more complex task than disassembly because much of the high-level structure (variable names, data types, loops, functions, control flow) is lost or transformed during compilation. Decompiled code is often functionally similar but may be difficult to read and may not perfectly match the original source code (it can be "obfuscated").
- Debuggers: These are essential tools for understanding and troubleshooting program execution at a low level. Debuggers allow a programmer to:
- Execute a program one instruction at a time.
- Inspect the contents of CPU registers and memory.
- Set breakpoints to pause execution at specific points.
- Often, debuggers can display the currently executing machine code alongside its disassembled representation.
Debuggers rely heavily on symbol tables to make the low-level execution understandable in terms of the original source code.
Symbol Table: A data structure generated during compilation or assembly that maps symbolic names (like variable names, function names, labels) used in the source code or assembly code to their corresponding numerical addresses or values in the compiled or assembled machine code.
Symbol tables contain debug symbols. If the symbol table is available (it can be embedded in the executable file or stored in a separate file), the debugger can use it to show you, for instance, that memory address 0x1A4C
corresponds to the variable player_score
or that the code at 0x2000
is the start of the calculate_total()
function. This makes debugging machine code much more manageable.
Examples of symbol table formats and storage:
- Early systems like the SHARE Operating System (1959) for IBM mainframes had formats like SQUOZE which included symbol tables.
- Modern IBM mainframe OS (z/OS) use ADATA (Associated data) files.
- Microsoft Windows uses Program Database (
.pdb
) files. - Unix-like systems use formats like stabs and DWARF (macOS uses
.dSYM
files for DWARF symbols).
Understanding machine code provides profound insight into how software interacts with hardware at the most fundamental level. While you won't be writing large applications directly in machine code, comprehending its nature, structure, and relationship to the CPU's operations is indispensable knowledge for anyone delving into the "Lost Art of Building a Computer from Scratch."